A Contextual Post-processing Model for Korean OCR using Synthesized Statistical Information
نویسندگان
چکیده
In this paper, we describe a contextual Korean OCR post-processing model considering unknown words. This work starts from the following premises: 1) In the language having very large character set, it is hard to directly correct erroneous string; 2) word formation is deeply related not only to morphological feature but also to phonological feature(esp. syllable combination on the surface level); 3) OCR post-processing system must rely on lexical and contextual information to correct OCR errors accurately. Based on the premises, the proposed system is composed of following three modules. First, it generates candidate words and evaluates the conndence of each candidate. Second, it selects feasible candidates by evaluating the word possibility of each candidate with its corresponding syllable connectivity and morphotactic connectivity. Third, it analyzes the contextual association among words and selects the most feasible word by the statistical information synthesizing the conndence weight, word possibility and contextual association strength of each candidate. Experimental results show that the proposed system can improve the performance of an OCR system from 94.1 to 97.6% by character-unit and from 87.6 to 97.1% by word-unit. In addition, this system can recognize 87.4% of unknown words.
منابع مشابه
Multi-level post-processing for Korean character recognition using morphological analysis and linguistic evaluation
Most of the post-processing methods for character recognition rely on contextual information of character and word-fragment levels. However, due to linguistic characteristics of Korean, such low-level information alone is not sufficient for high-quality character-recognition applications, and we need much higher-level contextual information to improve the recognition results. This paper present...
متن کاملAn OCR Post-processing Approach Based on Multi-knowledge
This paper proposes an OCR post-processing approach based on multi-knowledge, which integrates language knowledge and candidate distance information given by the OCR engine. In this approach, statistical language model and semantic lexicon are combined, and candidate distance information is used to reduce the size of the search space. The experimental results show that this approach is very eff...
متن کاملA post-processor for Gurmukhi OCR
A post-processing system for OCR of Gurmukhi script has been developed. Statistical information of Punjabi language syllable combinations, corpora look-up and certain heuristics based on Punjabi grammar rules have been combined to design the post-processor. An improvement of 3% in recognition rate, from 94.35% to 97.34%, has been reported on clean images using the post-processing techniques.
متن کاملStatistical Learning for OCR Text Correction
The accuracy of Optical Character Recognition (OCR) is crucial to the success of subsequent applications used in text analyzing pipeline. Recent models of OCR post-processing significantly improve the quality of OCR-generated text, but are still prone to suggest correction candidates from limited observations while insufficiently accounting for the characteristics of OCR errors. In this paper, ...
متن کاملEfficient OCR Post-Processing Combining Language, Hypothesis and Error Models
In this paper, an OCR post-processing method that combines a language model, OCR hypothesis information and an error model is proposed. The approach can be seen as a flexible and efficient way to perform Stochastic Error-Correcting Language Modeling. We use Weighted Finite-State Transducers (WFSTs) to represent the language model, the complete set of OCR hypotheses interpreted as a sequence of ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2007